Deep fragment embeddings for bidirectional image sentence mapping
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015 ICML)
- Feature Map after Front CNNs: 14x14x512 (widthxheightxchannel)
- each 1x1x512 is $\{a_i\}$
- Implementation Detail by checking source code
- $f_{att}$: 2 layer-MLP
- $f_{att}(\mathbf{a_i}, \mathbf{h}_{t-1})$ 's implementation
- intermediate layers : $\mathbf{ctx} = \mathbf{W}_{D->D}\mathbf{a_i} + \mathbf{bias}$
- final layer: $\mathbf{W}_{D->1}(\mathbf{ctx} + \mathbf{W}_{n->D}\mathbf{h}_{t-1}) + \mathbf{bias}$
- $f_{init,c}$: 2 layer-MLP
- $f_{init,h}$: 2 layer-MLP
- Source Code
Deep Visual-Semantic Alignments for Generating Image Descriptions
DenseCap: Fully Convolutional Localization Networks for Dense Captioning